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Activation of Voice-Controlled Apparatus 

Field of the Invention 
5 The present invention relates to the activation of voice-controlled apparatus. 

Background of the Invention 

Voice control of apparatus is becoming more common and there are now well developed 
technologies for speech recognition particularly in contexts that only require small 
10 vocabularies. 

However, a problem exists where there are multiple voice-controlled apparatus in close 
proximity since their vocabularies are likely to overlap giving rise to the possibility of 
several different pieces of apparatus responding to the same voice command. 

15 

It IS known from US 5,991,726 to provide a proximity sensor on a piece of voice- 
controlled industrial machinery or equipment. Activation of the machinery or equipment by 
voice can only be effected if a person is standing nearby. However, pieces of industrial 
machinery or equipment of the type being considered are generally not closely packed so 

>0 that whilst the proximity sensor has the effect of making voice control specific to the item 
concerned in that context, the same would not be true for voice controlled kitchen 
appliances as in the latter case the detection zones of the proximity sensors are likely to 
overlap. Furthermore, in the arrangement described in US 5,991 ,726, whilst the proximity 
sensor necessarily only responds to the presence of a nearby operator, the voice responsive 

:5 circuits of the machinery are not configured to respond only to voice input from that 
operator giving rise to the possibility of a shouted command from another operator causing 
false operation. 

With respect to this latter drawback, methods of acoustically locating a sound source are 
0 themselves well known so it would be possible to ensure that the machinery only 
responded to locally-spoken commands. Detecting the location of a sound source is usually 
done with an array of microphones; US 5,465,302 and US 6,009,396 both describe sound 
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source location detecting systems. By determining the location of the sound source, it is 
then possible to adjust the processing parameters of the input from the individual 
microphones of the array so as to effectively 'focus' the microphone on the sound source, 
enabling the sounds emitted from the source to be picked out from surrounding sounds. 

Of course, knowing the location of a speaker issuing a command for a voice-controlled 
device does not, of itself, solve the problem of voice control activating multiple pieces of 
apparatus. One possible solution to this problem is to require each voice command to be 
immediately preceded by speaking the name of the specific apparatus it is wished to 
control so that only that apparatus takes notice of the following command. This approach 
is not, however, user friendly and users frequently forget to follow such a command 
protocol, particularly when in a hurry. 

It is an object of the present invention to provide a more user-friendly way of minimising 
the risk of unwanted activation of multiple voice-controlled apparatus by the same verbal 
command. 

Summary of the Invention 

According to the present invention, there is provided a method of activating voice- 
controlled apparatus, comprising the steps of: 

(a) - using a microphone array to detecting when a user is facing towards the apparatus 

when making a sound; 

(b) - enabling the apparatus for voice control only if step (a) indicates that the user is 

simultaneously facing towards the apparatus and making a sound. 

Determination of whether the user is facing the voice controlled apparatus preferably 
involves: 

(i) - using the microphone array to determine the location of the user, 

(ii) - measuring the strength of the sound signal received at each microphone of the 

array, and 

(iii) - carrying out processing to effectively orientate a relative signal strength map for 

sounds generated by a human, situated at the determined location of the user, to 
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obtain a pattern of relative strengths at the microphones substantially corresponding 
to those measured in step (ii), the map orientation then giving the direction of 
facing of the user. 

Advantageously, the microphone array is made up of microphones associated with 
respective devices of a set of voice-controlled devices including said voice-controlled 
apparatus, the relative locations of the devices being known. The relative locations of the 
devices are known, for example, as a result of an automatic set-up process in which each 
device is caused to emit a sound at the same time as sending an electric or electromagnetic 
signal, the latter acting as a timing point enabling the other devices to determine their 
distance from the emitting device, the devices exchange their distances from other devices 
whereby to enable each device to calculate the relative locations of all devices. 

The present invention also encompasses a system and apparatus embodying the foregoing 
method of the invention. 



Brief Description of the Drawings 

A method and system embodying the invention, for controlling activation of voice- 
controlled devices, will now be described, by way of non-limiting example, with reference 
to the accompanying diagrammatic drawings, in which: 

. Figure 1 is a diagram illustrating a room equipped with a microphone array for 
controlling activation of voice-controlled devices in the room; 

. Figure 2 is a diagram illustrating the determination of the location of a speaker; 

. Figure 3 is a diagram illustrating the determination of the direction of facing of the 
speaker; 

. Figure 4 is a diagram illustrating a room in which there are multiple voice-controlled 
devices that cooperate to provide a microphone array; and 

. Figure 5 is a diagram illustrating the main sound-related functional capabilities of 
the Figure 4 voice-controlled devices. 
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Best Mode of Carrying Out the Invention 

Figure 1 shows a work space 11 in which a user 10 is present, facing in a direction 
indicated by dashed arrow 12. Within the space 1 1 are three voice-controlled devices 14 
(hereinafter referred to as devices A, B and C respectively) each with different 
5 functionality but each provided with a similar voice interface subsystem 15, including 
microphone 16, permitting voice control of the device by the user. 

The work space 1 1 is equipped with a set of three fixed room microphones 28 (hereinafter 
referred to as microphones M 1 , M2 and M3) that include digitisers for digitising the sound 

10 picked up, the digitised sound data then being passed via LAN 29 to a device activation 
manager 30. Manager 30 incorporates a sound-signal processing unit 33 which determines, 
in a manner to be more fully described below, when a user is facing towards a particular 
device. When the unit 33 determines that the user is facing towards a device 14 it informs 
a control block 34; unit 33 also informs block 34 whenever the user is speaking. Using 

15 this information, the control block 34 decides when to enable the voice interface of a 
particular device, and sends appropriate control messages to the devices via infrared links 
established between an TR transmitter 35 of the manager and IR receivers 36 of the device. 
The control block 34 ensures that the voice interface of only one device 14 is enabled at 
any one time. For convenience, in the following description, the control block 34 will be 
20 described as enabling/disabling the devices rather than their voice interfaces; however, it 
should be understood, that the devices may have other interfaces (such as manual 
interfaces) that are not under the control of the device activation manager. 

The control block 34 initially enables a device 14 when block 34 determines, from the 
25 information passed to it by unit 33, that the user is facing towards the device at the time of 
first speaking following a period of quiet of at least a predetermined duration. The control 
block then maintains enablement of the device concerned whilst the user continues to 
speak and for a timeout period thereafter even if the user faces away from the device - if 
the user starts speaking again during the timeout period the timing out of this period is 
30 reset. This timeout period is, for example, 3 seconds and is shorter than the aforesaid 
predetermined period of quiet that is required to precede initial enablement of a device. 
Thus, even if a user, whilst talking to a device, turns towards another device and pauses 
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briefly before speaking again, the control block does not switch to enabling that other 
device unless the pause is both longer than the timeout period (resulting in the previously- 
enabled device being disabled) and at least as long as the predetermined quiet period 
(resulting in the device current faced being enabled). The timeout period can, in fact, be of 
the same duration as the predetermined period of quiet. 

The control block 34 can be arranged to communicate its enablement decisions to the 
devices using any suitable protocol. For example, the control block can simply send an 
enable message to an identified device to enable it (with the other devices recognising that 
the message is not intended for them and ignoring it), and then subsequently send a 
disable message to disable the device. Altematively, each device can be arranged to require 
a continual supply of enable messages (for example, at least one every second) for its voice 
interface to remain enabled, absence of enable messages for greater than this period 
resulting in the voice interface of the device becoming disabled. 

In each device, the voice interface 15 comprises, in addition to microphone 16, a speech 
recogniser 17 (see device 14A in Figure 1) and an enable circuit 18. The enable circuit 18 
is fed from the infrared receiver 36 and holds the current enabled/disabled state of the 
voice interface. Circuit 18 enables or disables the speech recogniser 17 according to its 
stored state. When the speech recogniser 1 7 is enabled, it interprets the voice input picked 
up by microphone 16 and generates corresponding control outputs (see arrow 19) for 
controlling the functionality of the device 14. 

Whilst it would be possible to centralise the speech recognition functions of the devices 14 
in the device activation manager30, this would require the latter to be provided with a 
speech recogniser programmed both with the input vocabulary and control language of all 
devices that it may have to control. 

Figures 2 and 3 illustrate how the sound signal processing block 33 determines when user 
10 is facing towards a particular devices. For purposes of illustration, the user 10 is shown 
as being located in a position that is a distance from microphone M 1 , a distance "3Q" 
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from microphone M2, and a distance "4Q" from microphone M3. The signal processing 
block 33 is assumed to know the locations of the microphones Ml, M2 and M3. 

At time TO, user 10 emits a sound that travels at the speed of sound and reaches the 
5 microphones Ml , M2 and M3 at successive times Tl , T2 and T3. The sounds picked up by 
the microphones are passed to the processing block 33 where they are first correlated and 
the values (T2-T1) and (T3-T1) are determined; in the present example: 

2(T2-T1) = (T3-T1) 

The microphones each have their own internal clock that is used to provide time stamps for 
1 0 stamping the sound data passed to the processing block 33 to enable the above difference 
values to be determined, the offsets between the time clocks of the microphones having 
been previously measured by any suitable technique (for example, by each microphone 
responding with a time-stamped message a predetermined interval after receiving a trigger 
message from the manager 30, the internal processing times at both ends being taken into 
15 account). 

A measure of the received sound signal strength at each microphone Ml, M2, M3 is also 
passed to the processing block. 

20 Of course, the processing block does not know the time TO when the sound was emitted. 
However, by effecting a reverse construction of the sound wave fi^ont it is possible to 
determine the location of the user. More particularly, at time Tl , the sound wave front from 
the user: 

has just reached microphone M 1 , 
25 - is at a minimum distance V(T2-T 1 ) from microphone M2 somewhere on a circle C2 
of this radius centred on M2, and 

is at minimum distance V(T3 -T 1 ) fi-om microphone M3 somewhere on a circle C3 of 
this radius centred on M3 
where V is the velocity of sound. In fact, with respect to microphone Ml, the sound wave 
30 fi-ont can be generalised to be on a circle C 1 of radius V(T 1 -T 1 ) centred on M 1 . 
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If now the three circles are expanded (in effect by going back in time), there will eventually 
be a point of intersection of all three circles which corresponds to the location of the user. 

It will be appreciated that the foregoing description of how the location of the user is 
determined has been kept simple for reasons of clarity. Where the environment 1 1 is noisy 
or reverberating more compUcated signal processing is required to provide a reasonable 
location determination and appropriate techniques are described in the afore-mentioned US 
patents. 

Once the location of the user has been determined, the next step is to derive the direction of 
facing of the user. For this purpose, the processing block 33 holds data representing a 
relative sound signal strength map 40 (see contour set centred on user 10 in Figure 3) that 
indicates the relative sound signal strengths for sounds emitted by a user relative to their 
direction of facing, here indicated by dashed arrow 41. Processing block is arranged to 
carry out calculations corresponding to placing the origin of the map 40 at the determined 
location of the user and determining the relative sound signal strengths at the microphones 
Ml, M2 and M3 as the map 44 and microphone array are rotated relative to each other. 
These relative signal strength readings are then compared with the actual signal strength 
readings provided by the microphones M 1 , M2 and M3 to derive a 'best match' orientation 
oi the map and thus a direction of facing of the user. 

The derived direction of facing of the user is then used, together with the user's current 
location, to determine whether the user is facing, at least generally, towards any one of the 
devices 14, the locations of which are knovm to the processing block. The margin of error 
in facing direction permitted in deciding whether a user is facing a device will depend, in 
part, on the angular separation of adjacent devices 14. 

The signal processing block 33 passes its conclusions to control block 34 for the latter to 
control the devices in the manner already described. 

The map 40 will generally be for words spoken by the user 10. However, device 
enablement can be made dependent on the user making some other distinctive sound, such 
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as a hand clap, in which case the map should be a relative sound signal strength map for a 
handclap made by a person to their front. 



Figure 4 shows a second embodiment which is similar to the first embodiment but now the 
5 microphone array used to determine whether a user is facing a device is made up of the 
microphones 16 of the individual devices, the devices being equipped with short-range 
transceivers 56 (such as Bluetooth radio transceivers) for exchanging microphone data and 
thereby effectively coupling the microphones 16 into an array. The microphone data is 
time-stamped as with the Figure 1 embodiment with the relative offsets of the internal 
10 time-stamping clocks of the devices 14 being determined in any appropriate manner. 

Furthermore, sound signal processing is now effected in each device by a sound functions 
control block 57 that directly determines whether the device should be enabled or disabled 
and controls the speech recognition unit 23 accordingly. Thus, if a user is facing towards 

1 5 device C and starts to speak (after a period of quiet longer than the aforesaid predetemimed 
period of quiet), the microphone 16 at each of the three devices pick up this sound, 
digitises it and measure its strength, and the block 57 of the device sends on this data to the 
other devices and receives their corresponding data. Each block 57, which ah-eady knows 
the relative locations of the devices 14, now carries out a determination of the user's 

20 location and direction of facing and, as a result, determines whether the user is facing 
towards the device concerned. If a device decides that it is being addressed by the user it 
first informs the other devices, via the short-range transceiver, that it intends to enable its 
speech recogniser. Assuming that no conflict response is received back within a short 
window period, the block 57 proceeds to enable its associated speech recogniser 17. 

25 Preferably, the latter is fronted by a FIFO sound data store continuously fed from 
microphone 16 so that speech received from the user during the initial enablement 
determinations made by block 57 is not lost but is available for interpretation upon the 
speech recognition unit being enabled. 

30 To avoid excessive transmission of sound data between devices, the blocks 57 are arranged 
only to send the digitised sound and related signal strength measurements when there is a 
possibility of a device being newly enabled - that is, not during periods when a device is 
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enabled. For simplicity, the blocks 57 are arranged to send the microphone data only after a 
period of quiet at least as long as said predetermined period of quiet and prior to one of the 
devices informing the others that it has enabled its speech recogniser. 

Doing away with the fixed infrastructure and providing the devices with the means for 
cooperating with each other in effecting sound control functions, results in a very flexible 
arrangement. This flexibility is significantly enhanced by arranging for the devices to 
automatically calibrate themselves with regard to their mutual existence and locations. This 
becomes possible where there are at least three devices in the same space 1 1 . 

More particularly, assume that the devices of Figure 4 initially know nothing of each other. 
However, each is equipped with a loudspeaker for emitting a preferably distinctive "mating 
call" at random periodic intervals. At the same time as emitting its mating call, a device 
also sends ou( a mating signal via its short-range transceiver. This mating signal is detected 
by the other devices and if this signal is subsequently complemented by the receipt of the 
sound mating call received through the microphone of the device, then the device responds 
to tlie originating device over the short range transceiver. In this manner, the devices can 
establish what other devices are within sound range and form a local group. Using sound 
proximity to define this group is less likely to result in die group being spread across 
different rooms than if the short-range transceivers had been used for this purpose. 
Preferably, each device hearing the original sound is also required to emit its own mating 
call and signal in turn to ensure that all devices hearing the initial sound can also hear each 
other; devices that can only be heard by some but not all the other devices are 
excluded/included in the group of devices according to a predetermined policy. 

At this time, a Specking' order of member devices of the group can also be determined to 
provide a degree of order regarding, for example, the order of transmission of messages. In 
thisjtespect, it may be advantageous to employ a collision and back-off policy with respect 
to the initial mating call somewhat akin to that used for CSMA-CD data networks. The 
device fu-st to successfully transmit its mating call can be made the group leader and can, 
for example, be given the responsibility of setting the pecking order within the group. 
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Once group membership has been established, the devices take it in turn to simultaneously 
send their mating call and mating signal again. This time the mating signal is used as a 
timing mark against which the other devices can determine the time of travel of the mating 
call from the emitting device (it being assumed that the mating signal arrives effectively 
5 instantaneously at all devices). This enables each device to determine its distance from the 
emitting devices. By repeating this exercise for all devices in turn and by having the 
devices exchange their distance data, it becomes possible for the block 57 of each device to 
calculate the relative positions of all devices in the group. 

10 These two operations of determining group composition (and pecking order) and device 
locations are represented by steps 60 and 61 in Figure 5 and comprise an automatic set up 
phase for the group of devices. Since devices may be added or removed at any time, the 
devices are preferably arranged to initiate a fresh set up phase at intervals by the emitting 
of their mating calls and signals at a random time after the preceding execution of the set- 

15 up phase. 

The steps 60 and 61 can in part being combined with each device only emitting its mating 
call and signal a single time. 

Following the set up phase, the devices are ready to perform their sound-regulated device 
enablement role as afready described with reference to Figure 4, this role involving each 
device carrying out the tasks of detecting user input (step 62 in Figure 5), the determination 
of user location and direction of facing (step 63), and the activation of self if being 
addressed (step 64). 

A further role that the devices can usefiilly perform is the announcement to a user of their 
existence upon an appropriate prompt being generated, such as a user clapping their hands 
or a door sensor emitting-a signal (e.g. via a short range transmitter) upon a user entering 
the space 1 1 . For this role, the devices are equipped to detect the prompt signal and, in the 
case that the prompt is a sound, task 62 involves determining whether a detected sound is 
a prompt or some other sound. If the devices detect a prompt, then they each announce 
their presence through loudspeaker 55, this being done in turn. The order of announcement 
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can be done according to the pre-established pecking order or can be done in a cloclcwise 
(or anticlockwise) order starting at a particular device and having regard to the user's 
position. The user's position is determined by the devices in step 65 in the same way as it 
would be for device enablement if the prompt is a sound; if the prompt is some other signal 
5 generated upon user entry into the room, then this fixed position can be arranged to be 
made known previously to the devices (for example, a special portable "door device" can 
be positioned in the doorway and caused to trigger a new set-up phase in which its position 
and nature become known to the other group members, and even though the door device 
itself may not be present when the next set-up phase is triggered, the door position is 
1 0 thereafter retained in memory by the devices). 

The group leader can be designated always to start the announcement sequence (step 66) 
with each device then announcing when its turn comes (to detect this, the devices must 
listen to the other devices announcing, each device preferably leaving a clear gap before 
15 starting its announcement). If the user is detected in step 65 as facing towards a specifc 
device, then that device, rather than the group leader, can be arranged to be the first device 
to announce. 

20 Many other variants are, of course, possible to the arrangement described above. For 
example, a device can arranged to be enabled only whilst the user is actually facing it. 
Alternatively, initial enablement of a device can require the speaking of a key word 
identifying that device whilst the user is facing the device; in this case the device can be 
arranged to remain enabled until a keyword associated with another device is spoken whilst 

25 the user faces that device. In this case, the speech recogniser of each device must be 
continuously enabled, only its output 19 being subject to control. 

Various of processes carried out by the devices 1 4, particularly the devices 1 4 of Figure 4, 
can be effected independently of the task of enabling voice control of the devices. Thus, 
30 determining the direction of facing of a user can be done for other reasons such as to 
determine where to activate a visual alarm indicator to catch the attention of the user. 
Furthermore, the auto set-up process for the Figure 4 devices can be effected independently 
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of the enablement method as can the process for establishing the members of the local 
group of devices, and the process for ordering announcements to occur in a clockwise or 
anticlockwise sequence relative to the user. 
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CLAIMS 

L A method of activating voice-controlled apparatus, comprising the steps of : 

(a) - using a microphone array to detecting when a user is facing towards the apparatus 

when making a sound; 

(b) - enabling the apparatus for voice control only if step (a) indicates that the user is 

simultaneously facing towards the apparatus and making a sound. 

2. A method according to claim 1 , wherein determination of whether the user is facing the 
voice controlled apparatus involves: 

(i) - using the microphone array to determine the location of the user, 

(ii) - measuring the strength of the sound signal received at each microphone of the 

array, and 

(iii) - carrying out processing to effectively orientate a relative signal strength map for 

sounds generated by a human, situated at the determined location of the user, to 
obtain a pattern of relative strengths at the microphones substantially corresponding 
to those measured in step (ii), the map orientation then giving the direction of 
facing of the user. 

3. A method according to claim 1 or claim 2, wherein the microphone array is a fixed array 
separate from the apparatus, the relative locations of the apparatus and the microphones of 
the apparatus being known. 

4. A method according to claim 3, wherein the recognition of voice commands is carried 
out at the apparatus. 

5. A method according to any one of the preceding claims, wherein the microphone array 
is made up of microphones associated with respective devices of a group of voice- 
controlled devices including said voice-controlled apparatus, the relative locations of the 
devices being known. 
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6. A method according to claim 5, wherein the relative locations of the devices are known 
as a result of an automatic set-up process in which each device is caused to emit a sound at 
the same time as sending an electric or electromagnetic signal, the latter acting as a timing 
point enabling the other devices to determine their distance from the emitting device, the 
5 devices exchange their distances from other devices whereby to enable each device to 
calculate the relative locations of all devices. 



7. A method according to any one of the preceding claims, wherein the apparatus, after 
being initially enabled for voice control, continues to be so enabled following the user 

1 0 ceasing to face towards the apparatus but only whilst the user continues to speak and for a 
limited timeout period thereafter, recommencement of speaking during this period 
continuing voice control with timing of the timeout period being reset. 

8. A method according to any one of claims 1 to 6, wherein the apparatus only remains 
15 enabled for voice control whilst the user is facing the apparatus. 

9. A method according to any one of the preceding claims, wherein speech recognition 
means of the apparatus ignores voice input from the user unless whilst the user is facing 
towards the apparatus, the user speaks a predetermined key word. 

10. A method according to any one of the preceding claims, wherein the apparatus is only 
enabled in step (b) if there is at least a predetermined period of quiet immediately before 
the user produces a sound whilst facing the apparatus. 

11. An auto-location method for a plurality of voice-controlled devices equipped with 
respective microphones and electric or electromagnetic communication means, wherein 
each device is caused to emit a sound at the same time as sending an electric or 
electromagnetic signal, the latter acting as a timing point enabling the other devices to 
determine their distance from the emitting device, the devices exchange their distances 
from other devices whereby to enable each device to calculate the relative locations of all 
devices. 
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12. A group determination method for a plurality of voice-controlled devices equipped 
with respective microphones and electric or electromagnetic communication means, 
wherein a said device is caused to emit a sound at the same time as sending an electric or 
electromagnetic signal, any other said device receiving both the sound and the signal 
responding to the emitting device to indicate its presence and being thereby included as a 
member of a group also including the emitting device. 

13. An announcement method for a plurality of devices, wherein the devices knowing their 
relative locations, and knowing or determining the position of a user when the latter 
produces an announcement prompt, take respective turns to make sound announcements 
about themselves in an order that proceeds clockv/ise or anticlockwise with respect to the 
user. 




ABSTRACT 
Activation of Voice-Controlled Apparatus 

5 

A method of activating a voice-controlled device (14) is provided which minimises the 
risk of activating more than one such device at a time where multiple voice-controlled 
devices exist in close proximity. To activate a specific device, a user (10) is required both 
to be facing the device (14) and speaking at the same time. The device is then activated, 
1 0 preferably only whi 1st the speaking continues and for a limited period thereafter. Detection 
of whether the user is looking at the device is effected using a microphone array. This array 
is, for example, composed of the microphones (16) of the individual devices (14), the 
devices being provided with short-range transceivers (56) for passing microphone data 
between them. 
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